Discussion:
POI 3.10.1 XWPFRun getText() Does Not Return Full Line of Text
Keith Denny
2014-09-05 02:49:52 UTC
Permalink
Hello,

I am attempting to use POI to support a document/template tool and am
receiving unexpected results when I am parsing through an XWPFDocument.
Specifically, when I start reviewing each line of text, the String return
from the XWPFRun.getText() call is not the same text that is visible in the
actual document. Here are my specific details:

*Simple Use Case*
- Create a MS Word 2010 document, i.e. Test.docx NOTE: Although I am
basically doing something similar to MS templates, I am not using a .dotx
file; rather, my starting point is a .docx file.
- In the document, insert a text *<<TAG>>* such as 'Dear <<CLIENT_NAME>>'
NOTE: In the Word document, the line of characters 'Dear
<<CLIENT_NAME>>' exists all on a single line
- The *<<TAG>>* is a placeholder that will be dynamically replaced by a
custom document management system. In this case, there is a system entity
tag with the identifier as <<CLIENT_NAME>> and when the document is parsed,
the code will look to see if the entity tag, such as <<CLIENT_NAME>>,
exists in the document and will replace it with a real runtime value.

*Simplified Code:*
InputStream in = mContent.getBinaryStream();
String _newText;
XWPFDocument _doc = new XWPFDocument(in);
for (XWPFParagraph p : _doc.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null) {
LinkedHashMap<String, String> _entityMap =
(LinkedHashMap<String, String>)req.getSession().getAttribute("ENTITY_MAP");
Set<String> _entityKeys = _entityMap.keySet();
for (String key:_entityKeys) {
if (text.contains(key.trim())) {
_newText =
next.replace(key,_entityMap.get(key));
r.setText(_newText, 0);

}

}

}

}

}

*Results:*
One call to r.getText(0) returns only '<<CLIENT_' ;therefore, there's no
match with the comparison check of the entity tag of <<CLIENT_NAME>>. The
following call to r.getText(0) returns only 'NAME>>'. Again, obviously, no
match.

Sometimes, r.getText(0) returns <<CLIENT_NAME and leaves the trailing ">>"
for the next call to r.getText(0). Again, obviously, no match.

Sometimes, some tags do get returned by XWPFRun.getText() and the
substitution occurs as planned.

*Questions*

1. If the literal string of characters in the actual MS Word document exist
in one single line of text, why does XWPFRun.getText() return the line as
multiple sets of text characters?

2. How do I ensure that I get the actual line, as it exists in the MS Word
document, in POI so I can inspect and replace key text?

Any help would be greatly appreciated. Thank you in advance for your
feedback.

Sincerely,
Keith G. Denny
Nick Burch
2014-09-05 10:14:16 UTC
Permalink
Post by Keith Denny
*Results:*
One call to r.getText(0) returns only '<<CLIENT_' ;therefore, there's no
match with the comparison check of the entity tag of <<CLIENT_NAME>>. The
following call to r.getText(0) returns only 'NAME>>'. Again, obviously, no
match.
This is normal. That's just how the word file format works. A given run
contains text that is all styled the same. A paragraph is made up of
possibly multiple runs, each run having text of the same style, each
subsequent run may or may not have a different style

All depends on the history of the file, and what mood Word was in when
creating it
Post by Keith Denny
2. How do I ensure that I get the actual line, as it exists in the MS Word
document, in POI so I can inspect and replace key text?
Fetch the text at the paragraph level, then work out which run(s) to
change within that taking account that a given bit of text could well be
across multiple runs

Nick
Keith Denny
2014-09-05 18:16:10 UTC
Permalink
Nick,

Thank you for confirming the functionality. Basically, I'll have to
assemble the Paragraph lines from all the Runs and then inspect the
assembled Paragraph full text for my translation/substitution routine.

In essence, I think I will have to remove all the Runs after assembling
them at runtime, translate/make substitutions, and then add a single Run
back to the Paragraph with the whole text that was assembled. Is there a
limit to the size of a given Run?

Thanks,
Keith
Post by Nick Burch
Post by Keith Denny
*Results:*
One call to r.getText(0) returns only '<<CLIENT_' ;therefore, there's no
match with the comparison check of the entity tag of <<CLIENT_NAME>>. The
following call to r.getText(0) returns only 'NAME>>'. Again, obviously, no
match.
This is normal. That's just how the word file format works. A given run
contains text that is all styled the same. A paragraph is made up of
possibly multiple runs, each run having text of the same style, each
subsequent run may or may not have a different style
All depends on the history of the file, and what mood Word was in when
creating it
2. How do I ensure that I get the actual line, as it exists in the MS
Post by Keith Denny
Word
document, in POI so I can inspect and replace key text?
Fetch the text at the paragraph level, then work out which run(s) to
change within that taking account that a given bit of text could well be
across multiple runs
Nick
---------------------------------------------------------------------
Nick Burch
2014-09-05 18:20:08 UTC
Permalink
Post by Keith Denny
Thank you for confirming the functionality. Basically, I'll have to
assemble the Paragraph lines from all the Runs and then inspect the
assembled Paragraph full text for my translation/substitution routine.
The paragraph object itself can give you the overall paragraph text, why
not use that?
Post by Keith Denny
In essence, I think I will have to remove all the Runs after assembling
them at runtime, translate/make substitutions, and then add a single Run
back to the Paragraph with the whole text that was assembled. Is there a
limit to the size of a given Run?
Nope, runs are created by word to handle adjacent blocks of text that need
different formatting, and sometimes when it thinks there's a risk that
they might later / might once have... If the paragraph is supposed to all
be the same, there's something to be said for squashing it down to just
one run, then modifying the text in that!

Nick
Keith Denny
2014-09-05 18:42:26 UTC
Permalink
With a Run, I can set the text of the acquired Run. But, I don't see where
I can reset the text of the Paragraph if I get all the Paragraph text from
either getParagraphText or just getText. It would definitely be preferred
if I could do it with a setter such as setParagraphText or setText. Am I
overlooking a method like that?
Post by Nick Burch
Post by Keith Denny
Thank you for confirming the functionality. Basically, I'll have to
assemble the Paragraph lines from all the Runs and then inspect the
assembled Paragraph full text for my translation/substitution routine.
The paragraph object itself can give you the overall paragraph text, why
not use that?
In essence, I think I will have to remove all the Runs after assembling
Post by Keith Denny
them at runtime, translate/make substitutions, and then add a single Run
back to the Paragraph with the whole text that was assembled. Is there a
limit to the size of a given Run?
Nope, runs are created by word to handle adjacent blocks of text that need
different formatting, and sometimes when it thinks there's a risk that they
might later / might once have... If the paragraph is supposed to all be the
same, there's something to be said for squashing it down to just one run,
then modifying the text in that!
Nick
---------------------------------------------------------------------
Nick Burch
2014-09-05 18:50:09 UTC
Permalink
Post by Keith Denny
With a Run, I can set the text of the acquired Run. But, I don't see
where I can reset the text of the Paragraph if I get all the Paragraph
text from either getParagraphText or just getText. It would definitely
be preferred if I could do it with a setter such as setParagraphText or
setText. Am I overlooking a method like that?
I'm on a train right now, but IIRC there is a method on a xwpd paragraph
that'll zap all the runs and replace it with a single new one with the
given text. Check the source code and you ought to be able to find it!

Nick
Keith G. Denny
2014-09-05 19:36:54 UTC
Permalink
Good to know. Thank you for your assistance. Enjoy your weekend.

Respectfully,
Keith G. Denny

Sent from my mobile device
With a Run, I can set the text of the acquired Run. But, I don't see where I can reset the text of the Paragraph if I get all the Paragraph text from either getParagraphText or just getText. It would definitely be preferred if I could do it with a setter such as setParagraphText or setText. Am I overlooking a method like that?
I'm on a train right now, but IIRC there is a method on a xwpd paragraph that'll zap all the runs and replace it with a single new one with the given text. Check the source code and you ought to be able to find it!
Nick
---------------------------------------------------------------------
Loading...